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We model the evolution of eukaryotic protein-protein interaction (PPI) networks. In our model, 
PPI networks evolve by two known biological mechanisms: (1) Gene duplication, which is followed 
by rapid diversification of duplicate interactions. (2) Neofunctionalization, in which a mutation leads 
to a new interaction with some other protein. Since many interactions are due to simple surface 
compatibility, we hypothesize there is an increased likelihood of interacting with other proteins in 
the target protein’s neighborhood. We find good agreement of the model on 10 different network 
properties compared to high-confidence experimental PPI networks in yeast, fruit flies, and humans. 
Key findings are: (1) PPI networks evolve modular structures, with no need to invoke particular 
selection pressures. (2) Proteins in cells have on average about 6 degrees of separation, similar to 
some social networks, such as human-communication and actor networks. (3) Unlike social networks, 
which have a shrinking diameter (degree of maximum separation) over time, PPI networks are 
predicted to grow in diameter. (4) The model indicates that evolutionarily old proteins should have 
higher connectivities and be more centrally embedded in their networks. This suggests a way in 
which present-day proteomics data could provide insights into biological evolution. 


We are interested in the evolution of protein-protein 
interaction (PPI) networks. PPI network evolution ac¬ 
companies cellular evolution, and may be important for 
processes such as the emergence of antibiotic resistance in 
bacteria mm, the growth of cancer cells [3], and biolog¬ 
ical speciation BHS]. In recent years, increasingly large 
volumes of experimental PPI data have become avail¬ 
able um, and a variety of computational techniques 
have been created to process and analyze these data na¬ 
ns]. Although these techniques are diverse, and the ex¬ 
perimental data are noisy [19] , a general picture emerg¬ 
ing from these studies is that the evolutionary pressures 
shaping protein networks are deeply interlinked with the 
networks’ topology [20]. Our aim here is to construct 
a minimal model of PPI network evolution which accu¬ 
rately captures a broad panel of topological properties. 

In this work, we describe an evolutionary model for eu¬ 
karyotic PPI networks. In our model, protein networks 
evolve by two known biological mechanisms: (1) a gene 
can duplicate, putting one copy under new selective pres¬ 
sures that allow it to establish new relationships to other 
proteins in the cell, and (2) a protein undergoes a muta¬ 
tion that causes it to develop new binding or new func¬ 
tional relationships with existing proteins. In addition, 
we allow for the possibility that once a mutated protein 
develops a new relationship with another protein (called 
the target), the mutant protein can also more readily es¬ 
tablish relationships with other proteins in the target’s 
neighborhood. One goal is to see if random changes based 
on these mechanisms could generate networks with the 
properties of present-day PPI networks. Another goal is 
then to draw inferences about the evolutionary histories 
of PPI networks. 
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THE MODEL 

We represent a PPI network as a graph. Each node on 
the graph represents one protein. A link (edge) between 
two nodes represents a physical interaction between the 
two corresponding proteins. The links are undirected and 
unweighted. To model the evolution of the PPI graph, we 
simulate a series of steps in time. At time t, one protein 
in the network is subjected to either a gene duplication 
or a neofunctionalizing mutation, leading to an altered 
network by time t + At. We refer to this model as the 
DUNE (Duplication & NEofunctionalization) model. 


Gene duplication 

One mechanism by which PPI networks change is gene 
duplication (DU) [2111231 . In DU, an existing gene is 
copied, creating a new, identical gene. In our model, 
duplications occur at a rate d, which is assumed to be 
constant for each organism. All genes are accessible to 
duplication, with equal likelihood. For simplicity, we as¬ 
sume that one gene codes for one protein. One of the 
copies continues to perform the same biological function 
and remains under the same selective pressures as be¬ 
fore. The other copy is superfluous, since it is no longer 
essential for the functioning of the cell [ 23 ] . 

The superfluous copy of a protein/gene is under less 
selective pressure; it is free to lose its previous function 
and to develop some other function within the cell. Due 
to this reduced selective pressure, further mutations to 
the superfluous protein are more readily accepted, in¬ 
cluding those that would otherwise have been harmful 
to the organism [23 ES] • Hence, a superfluous protein 
diverges rapidly after its DU event [27l|28]. This well- 
known process is referred to as the post-duplication di- 
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Simulation complete 


FIG. 1. DUNE model flowchart. At each time step, the 
simulated network undergoes a duplication or neofunction¬ 
alization event. Red nodes/links indicate nodes/links that 
have been created by duplication during the current time 
step. Green links indicate links that have been created by 
neofunctionalization during the current time step. A dashed 
line indicates a duplicated link that has been deleted during 
the post-duplication divergence. Only 3 neighbors are shown 
for the assimilation mechanism; however, the actual simula¬ 
tions included up to 20th neighbors. The simulated network 
evolves until its number of links ( K ) meets or exceeds the 
number of links in the data (Adata). 


vergence. Following |29j , we assume that the link of each 
such superfluous protein/gene to its former neighbors is 
deleted with probability <f>. The post-duplication diver¬ 
gence tends to be fast; for simplicity, we assume the di¬ 
vergence occurs within the same time step as the DU. 
The divergence is asymmetric HUE]: one of the pro¬ 
teins diversifies rapidly, while the other protein retains 
its prior activity. We delete links from the original or 
the duplicate with equal probability because the proteins 
are identical. As discussed in the supporting information 
(SI), this is closely related to the idea of subfunctionaliza¬ 
tion, where divergence freely occurs until redundancy is 
eliminated. In our model, cf> is an adjustable parameter. 

In many cases, the post-duplication divergence results 
in a protein which has lost all its links. These ‘orphan’ 
proteins correspond to silenced or deleted genes in our 
model. As discussed below, our model predicts that the 
gene loss rate should be slightly higher than the duplica¬ 
tion rate in yeast, and slightly lower in flies and humans. 

We simulate a gene duplication event at time t as fol¬ 
lows: 


la) Duplicate a randomly-chosen gene with probability 
dAt. 

2a) Choose either the original (50%) or duplicate 
(50%), and delete each of its links with probability </>. 

3a) Move on to the next time interval, time t + At. 

Neofunctionalization 

Our model also takes into account that DNA can be 
changed by random mutations. Most such mutations do 
not lead to changes in the PPI network structure. How¬ 
ever, some protein mutations lead to new interactions 
with some other protein (which we call the target pro¬ 
tein). The formation of a novel interaction is called a 
neofunctionalization (NE) event. NE refers to the cre¬ 
ation of new interactions, not to the disappearance of 
old ones. Functional deletions tend to be deleterious to 
organisms [32] . We do not account for loss-of-function 
mutations (link deletions) except during post-duplication 
divergence because damaged alleles will, in general, be 
eliminated by purifying selection. In our model, NE mu¬ 
tations occur at a rate /z, which is assumed to be con¬ 
stant. All proteins are equally likely to be mutated. 

How does the mutated protein choose a target protein 
to which it links? We define a probability q that any pro¬ 
tein in the network is selected for receiving the new link 
from the mutant protein. To account for the possibility 
of homodimerization, the mutated protein may also link 
to itself [241 [33]. Random choice dictates that q = 1/N 
(see SI). 

Many PPI’s are driven by a simple geometric com¬ 
patibility between the surfaces of the proteins [34]. The 
simplest example is the case of PPI’s between flat, hy¬ 
drophobic surfaces |35| . a type of interaction which is 
very common [361 . These PPI’s have a simple planar in¬ 
terface, and the binding sites on the individual proteins 
are geometrically quite similar to one another. One con¬ 
sequence of these similar-surface interactions is that if 
protein A can bind to proteins B and C, then there is a 
greater-than-random chance that B and C will interact 
with each other. We refer to this property as transitiv¬ 
ity: if A binds B, and A binds C, then B binds C. The 
number of triangles in the PPI network should correlate 
roughly with transitivity. As discussed below, the num¬ 
ber of triangles (as quantified by the global clustering 
coefficient) is about 45 times higher in real PPI networks 
than in an equally-dense random graph. This suggests 
that transitivity is quite common in PPI networks. An¬ 
other source of transitivity is gene duplication. If A binds 
B, then A is copied to create a duplicate protein A’, then 
A’ will (initially) also bind B. If A interacts with A’, then 
a triangle exists. However, duplication is unlikely to be 
the primary source of transitivity; recent evidence shows 
that, due to the post-duplication divergence, duplicates 
tend to participate in fewer triangles than other proteins 

El- 

A concrete example of transitivity is provided by the 
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evolution of the retinoic acid receptor (RAR), an example 
of neofunctionalization which has been characterized in 
detail [38]. Three paralogs of RAR exist in vertebrates 
(RARa, /3, and 7 ), as a result of an ancient duplica¬ 
tion. The interaction profiles of these proteins are quite 
different. Previous work indicates that RAR/3 retained 
the role of the ancestral RAR [38], while RARa and 7 
evolved new functionality. RARa has several interac¬ 
tions not found in RAR/3. RARa has novel interactions 
with a histone deacetylase (HDAC3) as well as seven of 
HDAC3’s nearest-neighbors (HDAC4, MBD1, Q15959, 
NRIP1, Q59FP9, NR2E3, GATA2). None of these in¬ 
teractions are found in RAR/1. The probability that all 
of these novel interactions were created independently is 
very low. RARa has 65 known PPI’s and HDAC3 has 83, 
and the present-day size of the human PPI network is a 
little over 3000 proteins. Therefore, the chance of RARa 
randomly evolving novel interactions with 7 of HDAC3’s 
neighbors is less than 1 in a billion. This strongly sug¬ 
gests that when a protein evolves an interaction to a tar¬ 
get, it has a greater-than-random chance of also linking 
to other, neighboring proteins. 

How do similar-surface interactions affect the evolu¬ 
tion of PPI networks? First, consider how an interaction 
triangle would form. Suppose proteins A and B bind due 
to physically similar binding sites. Protein X mutates 
and evolves the capacity to bind A. There is a reason¬ 
able chance that X has a surface which is similar to both 
A and B. If so, protein X is likely to also bind to B, form¬ 
ing a triangle. Denote the probability that two proteins 
interact due to a simple binding site similarity by a. The 
probability that A binds B (and X binds A) in this man¬ 
ner is a. Assuming these probabilities are identical and 
independent, the probability that X binds B is a 2 . 

So far, we have discussed transitivity as it affects the 
PPI’s in which protein A is directly involved (A’s first- 
neighbors). We now introduce a third protein to the 
above example, resulting in a chain of interactions: pro¬ 
tein A binds B, B binds C, but C does not bind A. Protein 
X mutates and gains an interaction with A (with proba¬ 
bility a 2 ). What is the probability that X will also bind 
C? The probability that B binds C due to surface similar¬ 
ity is a. Thus, X will bind C (A’s second-neighbor) with 
probability a 3 . In general, the probability that X will 
bind one of A’s j th neighbors is a J+1 . We refer to this pro¬ 
cess as assimilation , and the ‘assimilation parameter’ a is 
a constant which varies between species. As discussed in 
SI, it is primarily mutliple-partner proteins which bind to 
their partners at different times and/or locations which 
are affected by this process; consequently, at most one 
link is created by assimilation at the first-neighbor level, 
second-neighbor level, etc. Assimilation is assumed to 
act on a much shorter time scale than duplication and 
neofunctionalization; in our model, it is instantaneous. 

Our hypothesized assimilation mechanism makes sev¬ 
eral predictions that could be tested experimentally: ( 1 ) 
the probability of a protein assimilating into a new path¬ 
way should be a 2 (at the first-neighbor level), a 3 (at the 


second-neighbor level), and so on, where a is a constant 
which varies between species; ( 2 ) weak, nonspecific bind¬ 
ing and planar interfaces should be overrepresented in 
interaction triangles (and longer cycles) between non¬ 
duplicate proteins; (3) competitive inhibitors should be 
overrepresented in interaction triangles; and (4) domain 
shuffling should be associated with assimilation. (See SI 
for discussion of (3) and (4).) 

We simulate a neofunctionalization event at time t as 
follows: 

lb) Mutate a randomly-chosen gene with probability 
pAt. 

2b) Link to a randomly-chosen target protein. 

3b) Add a second link to one of the target’s first- 
neighbor proteins, chosen randomly, with probability a 2 . 

4b) Add a link to one of the target’s second-neighbor 
proteins, with probability a 3 , etc. 

5b) Move on to the next time interval, time t + At. 


Model simulation and parameters 

A flowchart of how PPI networks evolve in our model 
is shown in Figure [l] To simulate the network’s evolu¬ 
tion, one of the two mechanisms above is used at each 
time step, using [32]. We call each possible time series 
a trajectory. We begin each trajectory starting from two 
proteins sharing a link (the simplest configuration that 
is still technically a network). Each simulated trajec¬ 
tory ends when the model network has grown to have 
the same total number of links, K , as found in the ex¬ 
perimental data, ATdata- Here, we perform sets of simula¬ 
tions for three different organisms: yeast ( Saccharomyces 
cerevisiae), fruit flies ( Drosophila melanogaster) , and hu¬ 
mans ( Homo sapiens). Because evolution is stochastic, 
there are different possible trajectories, even for identi¬ 
cal starting conditions and parameters. We simulated 50 
trajectories for each organism. Our figures show the me- 


TABLE I. Network sizes and model parameters 



N data -^-data 

d p (f> a 

Yeast 

Fly 

Human 

2170 3819 
878 1140 

3165 5547 

0.01 7.86 x 10“ 4 0.555 0.690 
0.0014 5.89 x 10“ 4 0.866 0.546 
0.0037 7.62 x 10“ 4 0.652 0.727 


N and K are the numbers of proteins and links, respectively. 
(Kd ata is used to stop the simulation. Adata is not used as a 
constraint.) d and p have units of per gene per million years 
(Myr). <j> and a are probabilities (unitless). Adata and d are 
constraints from the data, while p, <f>, and a are adjustable 
parameters. We used Monte Carlo simulations to optimize 
the parameter values, by minimizing the total symmetric 
mean absolute percentage error values of the simulated 
versus the experimental data (see SI). Our values of p are 
substantially lower than d because p is the rate of mutations 
leading to the creation of a new PPI (rather than being a 
simple mutation rate, which would be much higher). 
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FIG. 2. Degree centrality. Degree (fc) distributions in hu¬ 
man (green), yeast (blue), and fly (red). Heavy lines are the 
median values from 50 simulations, and light lines are results 
of individual simulations. Points represent high-confidence 
empirical data for each organism (see Methods). Unless oth¬ 
erwise noted, color coding in the same in all plots. Quanti¬ 
tative comparisons between simulation and experiment (for 
DUNE and several other models) are detailed in SI. 

dian values of each feature as a heavy line, and individual 
trajectories as light lines. 

For a given data set, the number of links (ATdata) is 
known. We estimate the duplication rate d from litera¬ 
ture values. There have been several empirical estimates 
of duplication rates, mostly falling within an order of 
magnitude of each other [271 140U45] . We averaged to¬ 
gether the literature values to estimate d for each species 

(Table 0- 

The quantity /z is not as well known. Its value relative 
to d has been the topic of considerable debate [Ml HSh 
148] , Although, in principle, p is a measurable quantity, 
it has proven difficult to obtain an accurate value, in 
part because the fixation rate of neofunctionalized alleles 
varies with population size [49j [50] . In the absence of 
a consensus order-of-magnitude estimate, in our model, 
we treat p as a fitting parameter. Consistent with the 
findings of m and , our best-fit values of p are within 
an order of magnitude of each other for yeast, fruit fly, 
and human networks. Best-fit parameter values are given 
in Table [U 


RESULTS 

Present-day network topology 

One test of an evolutionary model is its predictions 
for present-day PPI network topologies. Current large- 
scale PPI data sets have a high level of noise, resulting 
in significant problems with false positives and negatives 


m mi. To mitigate this, we compare only to ‘high- 
confidence’ experimental PPI network data gathered in 
small-scale experiments (see Methods). We computed 10 
topological features, quantifying various static and dy¬ 
namic aspects of the networks’ global and local struc¬ 
tures: degree, closeness, eigenvalues, betweenness, mod¬ 
ularity, diameter, error tolerance, largest component size, 
clustering coefficients, and assortativity. 8 of these prop¬ 
erties are described below (see SI for others). 

The degree k of a node is the number of links con¬ 
nected to it. For protein networks, a protein’s degree is 
the number of proteins with which it has direct interac¬ 
tions. Some proteins interact with few other proteins, 
while other proteins (called ‘hubs’) interact with many 
other proteins. Previous work indicates that hubs have 
structural and functional characteristics that distinguish 
them from non-hubs, such as increased proportion of dis¬ 
ordered surface residues and repetitive domain structures 
[56j . The high degree of a protein hub could indicate 
that protein has unusual biological significance [57]. The 
network’s overall link density is described by its mean 
degree, ( k ) (Table |H|. The degree distribution p(k) is the 
probability that a protein will have k links. PPI net¬ 
works have a few hub proteins and many relatively iso¬ 
lated proteins. The heavy tail of the degree distribution 
shows that PPI networks have significantly more hubs 
than random networks have. Simulated and experimen¬ 
tal degree distributions are compared in Figure [2j (For 
quantitative comparisons, see SI.) 

Component refers to a set of reachable proteins. If any 
protein is reachable from any other protein (by hopping 
from neighbor to neighbor), then the network only has 
one component. If there is no path leading from protein 
A to B, then A and B are in different components. The 



Q 

D 

h 

(C) 

(k) 

Yeast data 

0.75 

15 

0.89 

0.09 

3.65 

DUNE 

0.74(7) 

17(6) 

0.8(1) 

0.041(9) 

4.0(8) 

Vazquez 

0.80(4) 

21(5) 

0.2(1) 

0.045(5) 

2.6(4) 

Berg 

0.518(4) 

12.0(7) 0.990(3) 0.0027(9) 4.10(3) 

RG 

0.910(3) 

36(3) 

0.987(6) 

0.475(8) 

5.31(8) 

MpK 

0.58(6) 

24(5) 

1.000(2) 

0.08(3) 

4.4(6) 

ER 

0.588(8) 

13.0(9) 0.995(2) 

0.002(1) 

3.5(6) 

Fly data 

0.86 

23 

0.73 

0.10 

2.93 

DUNE 

0.82(2) 

20(2) 

0.81(3) 

0.09(1) 

2.36(9) 

Human data 

0.75 

15 

0.88 

0.08 

3.69 

DUNE 

0.74(6) 

17(2) 

0.88(4) 

0.09(1) 

3.7(4) 


TABLE II. Comparison of network features. Modularity 
Q, diameter D, fraction of nodes in the largest component 
/i, global clustering coefficient (C), and ( k) is the average 
degree of proteins the largest component. ‘Data’ is the em¬ 
pirical data, ‘DUNE’ is the model described here, ‘Vazquez’ 
is the duplication-only model of [22j, ‘Berg’ is the link dy¬ 
namics model [52], ‘RG’ is random geometric [53], ‘MpK’ is 
the physical desolvation model presented in [54], and ‘ER’ is 
an Erdos-Renyi random graph [55]. Simulated values are the 
median (± standard deviation) over 50 simulations. (See SI 
for details of each model’s setup and optimization.) 
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Low average closeness 



High average closeness 



FIG. 3. Closeness centrality. (A) Closeness {£) distributions in human (green), yeast (blue), and fly (red). Heavy lines are 
the median values from 50 simulations, and light lines are results of individual simulations. (B) Examples of networks with 
low average closeness (£) = 0.06 (top; each node is generally far away from most other nodes because there are no ‘short cuts’) 
and high average closeness (£) = 0.28 (bottom; the random connections allow each node to be only a short distance from the 
other nodes). Note that both networks pictured here have the same number of nodes (N = 100) and roughly the same average 
degree (top: ( k} = 4, bottom: (k) = 3.7). 


fraction of nodes in the largest component (/i) is a mea¬ 
sure of network fragmentation (Table |ll| and Figure |S2j). 
Note that, although silent genes (proteins with no links) 
exist in real systems, these genes do not appear in data 
sets consisting only of PPI’s. Therefore, calculations of 
/i for all models exclude orphan proteins (proteins with 
k = 0). 

Gene loss, the silencing or deletion of genes, is known 
to play an important role in evolution. The loss of a func¬ 


tioning gene will damage an organism, making the gene 
loss unlikely to be passed on. The exception is if the 
gene is redundant. Consistent with this reasoning, evi¬ 
dence suggests that many gene loss events are losses of 
one copy of a duplicated gene [30* i5£j ■ Although empiri¬ 
cal estimates of the gene loss rate varied considerably, a 
consistent finding across several studies is that the rates 
of gene duplication and loss are of the same order-of- 
magnitude [271 El 04]. This broad picture is in good 
agreement with our model. In our model, a gene is con- 
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Betweenness centrality 



b 


FIG. 4. Hierarchical clustering. Median clustering coeffi¬ 
cient vs. degree in human (green), yeast (blue), and fly (red). 
Heavy lines are the median values from 50 simulations, and 
light lines are results of individual simulations. 


FIG. 5. Betweenness centrality. Betweenness (6) distri¬ 
butions in human (green), yeast (blue), and fly (red). Heavy 
lines are the median values from 50 simulations, and light 
lines are results of individual simulations. 
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sidered lost when it has degree zero. Our model predicts 
that the ratio of orphan to non-orphan proteins is 1.6±0.4 
in yeast, 0.58 ± 0.06 in flies, and 0.67 ± 0.09 in humans. 
The gene loss rate has been previously estimated to be 
about half the duplication rate in both flies and humans 
[211113, consistent with our model’s prediction. 

The distance between nodes i and j is defined as the 
number of node-to-node steps that it takes along the 
shortest path to get from node i to j. The closeness cen¬ 
trality of a node i. ti, is the inverse of the average distance 
from node i to all other nodes in the same component. 
The diameter, D, of a network is the longest distance in 
the network. Simulated closeness distributions are com¬ 
pared to experiments in Figure [3] Interestingly, proteins 
have about ‘six degrees of separation’, similar to social 
networks [551150] . The closeness distributions p(£) have 
peaks around 1/^ ss 5 — 7. 

Another property of a network is its modularity |61| . 
Networks are modular if they have high densities of links 
(defining regions called modules), connected by lower 
densities of links (between modules). One way to quan¬ 
tify the extent of modular organization in a network is 
to compute the modularity index, Q |52j, 155] : 

Q= ( A b - S(ui,Uj), (1) 

i,j ^ ' 

where ki and kj are the degrees of nodes i and j, Ui and 
Uj denote the modules to which nodes i and j belong, 
S(iii,Uj ) = 1 if Ui = Uj and 5(ui,Uj ) = 0 otherwise, and 
A,jj = 1 if nodes i and j share a link, and Aij = 0 oth¬ 
erwise. Q quantifies the difference between the actual 
within-module link density to the expected link density 
in a randomly connected network. Q ranges between — 1 


Betweenness vs. degree 
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FIG. 6. Betweenness vs. degree. Shown are median be¬ 
tweenness vs. degree values in human (green), yeast (blue), 
and fly (red). Heavy lines are the median values from 50 sim¬ 
ulations, and light lines are results of individual simulations. 


Assortativity 



k 


FIG. 7. Assortativity. Median nearest-neighbor degree 
vs. degree in human (green), yeast (blue), and fly (red). 
Heavy lines are the median values from 50 simulations, and 
light lines are results of individual simulations. 

and 1; positive values of Q indicate that the number of 
links within modules is greater than random. The numer¬ 
ical value of Q required for a network to be considered 
‘modular’ depends on the number of nodes and links and 
method of computation. To calibrate baseline Q values 
given our particular network data, we used the null model 
described in |64j . Our non-modular baseline values are 
Q = 0.603 for the human PPI net, Q = 0.590 for yeast, 
and Q = 0.722 for flies (see SI). As shown in Table [ll[ 
PPI networks are highly modular, and our simulated Q 
values are in good agreement with those of experimental 
data. 

The clustering coefficient, Ci, for a protein i, is a mea¬ 
sure of mutual connectivity of the neighbors of protein i. 
Ci is defined as the ratio of the actual number of links 
between neighbors of protein i to the maximum possible 
number of links between them, 

# edges between neighbors of node i 


In a PPI network, clustering is thought to reflect the high 
likelihood that proteins of similar function are mutually 
connected [B5i. The average (or global) clustering co¬ 
efficient, (C), quantifies the extent of clustering in the 
network as a whole. As shown in Table |H[ PPI networks 
have large global clustering coefficient values; the yeast 
PPI network, for example, has a value of (C) which is 45 
times higher than that of a random graph of equivalent 
link density. In flies and humans, our simulated networks 
have ( C) values in excellent agreement with the data; in 
yeast, our predicted value is slightly low. 

A network is said to be ‘hierarchically clustered’ if the 
clustering coefficient and deg ree obey a power-law rela¬ 
tion, C ~ k ~£ |M] (Figure 0), indicating that nodes are 
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Modularity Diameter 

J 3(1.-.-.-r- 



FIG. 8. Modularity and diameter. (A) Modularity Q and (B) diameter D are both predicted to grow with time in human 
(green), yeast (blue), and fly (red). Light lines indicate the evolutionary trajectories of 50 individual simulations, and the heavy 
line is the median value. The modularity and diameter of the empirical data are shown as dashed horizontal lines. Time traces 
occasionally do not start at t = 0 because these simulations spend the first few time steps in a completely disconnected state, 
so the dynamical quantities are undefined. (See Figure [S2| for other dynamical plots.) 


organized into small-scale modules, and the small-scale 
modules are in turn organized into larger-scale modules 
following the same pattern [57] , By plotting each node’s 
clustering coefficient against its degree, we observed a 
trend consistent with hierarchical clustering, although 
data in the tail is very limited. 

The betweenness of a node measures the extent to 
which it ‘bridges’ between different modules. Between¬ 
ness centrality , b, is defined as: 

_ # shortest paths passing through node i 
# total shortest paths 

Betweenness has been proposed as a uniquely 
functionally-relevant metric for PPI networks be¬ 
cause it relates local and global topology. It has been 
argued that knocking out a protein that has high 
betweenness may be more lethal to an organism than 
knocking out a protein of high degree [5S] • Betweenness 
distributions are shown in Figure [5j 

If a network’s well-connected nodes are mostly at¬ 
tached to poorly-connected nodes, the network is called 
disassortative. A simple way to quantify disassortativity 
is by determining the median degree of a protein’s neigh¬ 
bors (n) as a function of its degree ( k ). Previous work has 
found that yeast networks are disassortative |53|. It has 
been argued that disassortativity is an essential feature 
of PPI network evolution, and recent modeling efforts 
have heavily emphasized this feature ESI Eg. However, 
it was noted by m that disassortativity may simply be 
an artifact of the yeast two-hybrid technique, and m 
pointed out that this trend is quite different among dif¬ 
ferent yeast datasets, and in some cases is completely 
reversed, resulting in assortative mixing, where high de¬ 
gree proteins prefer to link to other high-degree proteins. 


As shown in Figure [7] and Table [Si] the empirical data 
shows no evidence of disassortativity in flies or humans, 
and even the trend in yeast is quite weak. This conclu¬ 
sion is based solely on analysis of the empirical data, and 
casts further doubt on the role of disassortative mixing 
in PPI network evolution. 

Some higher-order features of the network are simply 
a result of its degree sequence, and other features might 
be important in their own right. As discussed in |6T] . it 
is possible to isolate the effects of the degree sequence by 
‘rewiring’ (detaching then reattaching links) the network 
at random, subject to the restriction that the degree se¬ 
quence must be preserved. If a property contains extra 
information about the network’s structure, then it should 
be different in the rewired network. On the other hand, 
if the network is rewired many times, and the property 
is always the same, then it is likely to just be a result of 
the degree sequence. We randomly rewired the empiri¬ 
cal network 4A'd ata times Q As expected, modularity is 
decreased by random rewiring. Upon rewiring, we find 
Q = 0.603 ± 0.002 in humans, Q = 0.590 ± 0.003 in 
yeast, and Q = 0.722 ±0.007 in flies (median ± standard 
deviation from 50 repeats of the rewiring algorithm). 
Rewiring also shrinks the diameters of PPI networks to 
D = 13 ± 0.9 in humans, D = 12 ± 1.0 in yeast, and 
D = 15 ± 1.1 in flies. These results suggest that these 
features contain important structural information about 
the network, and are not merely consequences of the de¬ 
gree sequence. 


1 Network rewiring was carried out using a script downloaded from 

http: //www. emth. bnl. gov/~maslov/matlab. htm. 






























One reason we are interested in calculating Q and D is 
simply to check that the values are comparable between 
the simulated and experimental networks. However, on 
a more qualitative level, we would also like to have some 
idea of what the threshold is for a network to be consid¬ 
ered ‘modular’ or ‘small-diameter’. The rewired Q and 
D values are useful because these features are dependent 
on the size of the network (number of nodes N and links 
AT); given an identical network construction method, Q 
and D will generally be different in sparse versus dense 
networks. We use these Q and D values as baseline val¬ 
ues with which the experimental and simulated networks 
can be compared; we considered Q and D values differing 
from the rewired values by more than a standard devia¬ 
tion to be significantly different. 

Comparisons of simulated and experimental eigenvalue 
spectra and error tolerance curves are shown in SI (Fig¬ 
ures [Si] and [S4]). As discussed in SI, the various per-node 
network properties we have analyzed are largely uncor¬ 
related (Figure [S6|. 

Evolutionary trajectories 

We now consider the question of how PPI networks 
evolve in time. The present-day networks show a rich- 
get-richer structure: PPI networks tend to have both 
more well-connected nodes and more poorly connected 
nodes than random networks have. In our model, the 
rich-get-richer property has two bases: duplication and 
assimilation. The equal duplication chance per protein 
means the probability for a protein with k links to ac¬ 
quire a new link via duplication of one of its interaction 
partners is proportional to k. Likewise, the probability of 
a protein to receive a link from the first-neighbor assimi¬ 
lation probability a is proportional to its degree k. ‘Rich’ 
proteins get richer because the probability of acquiring 
new links rises with the number of existing links. 

First, we discuss two dynamical quantities for which 
experimental evidence exists: the rate of gene loss, and 
the relation between a protein’s age and its centrality. 
Gene losses in our model correspond to ‘orphan’ pro¬ 
teins which have no interactions with other proteins. As 
shown in Figure [S2} the fraction of orphan proteins grows 
quickly at first, then levels off. This is consistent with the 
findings of 1441 : in humans, while the overall duplication 
rate is higher than the loss rate, when only data from 
the past 200 Myr are considered, the loss rate is slightly 
higher than the duplication rate. In our model, after the 
initial rapid expansion, the rate of gene loss stabilizes 
relative to the duplication rate. 

We define the ‘age’ of a protein in our simulation ac¬ 
cording to the order in which proteins were added to the 
network. Our model shows that a protein’s age correlates 
with certain network properties. Consistent with earlier 
work [781176] , we find that older proteins tend to be more 
highly connected. We plotted the ‘age index’ of a protein 
(the time step at which the protein was introduced) ver- 
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FIG. 9. Gene ontology. Shown are GO-slim profiles for the 
100 oldest and 100 youngest proteins, as measured by S-value, 
in the yeast PPI network. 
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Centrality vs. age 



Age index (oldest —> newest) 


Centrality vs. age 


Age index (oldest —> newest) 


Centrality vs. age 







Age index (oldest —> newest) 


FIG. 10. Older proteins are more central. Simulations of a protein’s age index (time since introduction into the network) 
vs. degree (fc), betweenness ( b ), and closeness (£) centrality, for human (green), yeast (blue), and fly (red). The oldest proteins 
are on the left in this figure, and the proteins get younger moving to the right. There is an approximately monotonic increase 
in centrality with age. 


sus its centrality scores. As shown in Figure [lUJ the age 
index negatively correlates with degree, betweenness, and 
closeness centralities: older proteins tend to be more cen¬ 
tral than younger proteins. Figure [TO] shows our model’s 
prediction that a protein’s age correlates with degree, 
betweenness, and closeness centrality. We confirmed this 
prediction by following the evolutionary trajectories of 
individual proteins (Figure [S3] ) . These results are consis¬ 
tent with the eigenvalue-based aging method described 
in [75] (Figure [S5|. Phylogenetic protein age estimates 
indicate that older proteins tend to have a higher de¬ 
gree [73] [75], which our model correctly predicts. Inter¬ 
estingly, the eigenvalue-based scores are only modestly 
correlated with other centrality scores (0.36 degree, 0.47 
betweenness, and 0.10 closeness correlations). Using the 
eigenvalue method in tandem with our centrality-based 
method could provide stronger age-discriminating power 
for PPI networks than either method alone. 

The correlation between centrality and age suggests 
that static properties of present-day networks may be 
used to estimate relative protein ages. Suppose each nor¬ 
malized centrality score ( k' = k/ max(fc), t' = £/ max(£), 
b' = 6 /max( 6 )) represents a coordinate in a 3-D ‘cen¬ 
trality space’. We can then define a composite centrality 
score' (S) as S 2 = (k') 2 + (f) 2 + ( b ') 2 . 

Do older proteins typically have different functions 
than newer proteins? We classified S. cerevisiae pro¬ 
teins using the GO-slim gene ontology system in the Sac- 
charomyces Genome Database. As shown in Figure [9] 
GO-slim enrichment profiles were somewhat different be¬ 
tween the oldest and youngest proteins (as measured by 
their S values). Several categories which were more en¬ 
riched for the oldest proteins were the cell cycle, stress 
response, cytoskeletal and cell membrane organization, 
whereas younger proteins were overrepresented in sev¬ 
eral metabolic processes. Overall, the differences were 
not dramatic, suggesting that cellular processes generally 
require both central and non-central proteins to function. 


Consistent with this, ancient proteins tend to be centrally 
located with modules, as their betweenness values grad¬ 
ually decline over time (Figure [S3] ) . The roughly linear 
relation between degree and betweenness also suggests 
that ancient proteins do not occupy structurally ‘special’ 
positions within the network, such as stitching together 
separate modules (Table [Si] and Figure [ 6 ]). This may in¬ 
dicate that modules tend to accumulate around the most 
ancient proteins, which act as a sort of nucleus. Thus, 
ancient proteins are involved in all kinds of pathways, 
because they have each nucleated their own pathway. 


In contrast to the two dynamical quantities discussed 
so far, most structural properties of PPI networks have 
only been measured for the present-day network. Al¬ 
though our model accurately reproduces the present-day 
values of these quantities, there is no direct evidence that 
the simulated trajectories are correct; rather, these are 
predictions of our model. Figure [ 8 ] shows that both mod¬ 
ularity Q and diameter D increase with time. These are 
not predictions that can be tested yet for biological sys¬ 
tems, since there is no time-resolved data yet available 
for PPI evolution. Time-resolved data is only currently 
available for various social networks (links to websites, 
co-authorship networks, etc.). Interestingly, the diame¬ 
ters of social networks are found to shrink over time m 
Our model predicts that PPI networks differ from these 
social networks in that their diameters grow over time. 
In addition to Q and D , we tracked the evolutionary tra¬ 
jectories of several other quantities: the evolution of the 
global clustering coefficient, the rate of signal propaga¬ 
tion, the size of the largest connected component (Fig¬ 
ure S2), as well as betweenness and degree values for 
individual nodes (Figure [S3|. See SI for details. 


The number of ‘extra’ links created by each assimila¬ 
tion event should independent of N. Proteins with mul¬ 
tiple interaction partners may interact with their part¬ 
ners simultaneously (so-called ‘party hubs’) or at differ¬ 
ent times/locations (‘date hubs’) 78]. Since they bind 
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FIG. 11. Sensitivity analysis. Heat maps represent median values for 10 simulations per parameter combination of the yeast 
network. Left: (j> and a are varied, d and p values are kept fixed. Right: d and p varied, <f> and a kept fixed. 


to several partners simultaneously, party hubs typically 
have multiple binding sites, each specific to one bind¬ 
ing partner 131 • This specificity suggests that a pro¬ 
tein which evolved the capability to bind to a party hub 
would be unlikely to undergo assimilation. By contrast, 
the binding sites of date hubs are often disordered re¬ 
gions which are able to form transient interactions with 
multiple partners [SDJ EH- If a protein evolves the ca¬ 
pability to bind to a date hub, it is likely to share the 
physical characteristics of the hub’s neighbors, leading to 
assimilation. However, to avoid competition for the same 
binding site, the interaction partners of date hubs tend 
not to be coexpressed US]. One consequence of this is 
that assimilating proteins will likely only bind one of the 
target protein’s neighbors - whichever neighbor happens 
to be present at that time and place. Although the capa¬ 
bility to bind to the hub protein’s other neighbors may 
initially be present, these will presumably remain unused 
in the cell. Our expectation is that the assimilating pro¬ 
tein will therefore be unlikely to retain this capability, 
as it evolves. Similarly, only a single extra link should 
be generated at the second-neighbor level, third-neighbor 
level, etc. Consistent with the evidence discussed above, 
the number of links created by assimilation is approxi¬ 
mately independent of the total network size. Party hubs 
typically are centrally-located within modules, while date 
hubs often function to stitch together large-scale modules 
in the cell. It may be that duplication-only models are 
unrealistically fragmented (Table |ll|) because their mod¬ 
ules are not properly attached with date hubs; instead, 
the modules are disconnected components. 


Sensitivity analysis 

The DUNE model has four parameters. One parame¬ 
ter, the DU rate d, is estimated from empirical data. The 
other three are adjustable parameters: the NE rate /r, the 
divergence probability <p, and the assimilation probability 
a. To gain a better understanding of how these parame¬ 
ters affect the final network structure, starting with each 
organism’s set of parameters, we systematically adjusted 
all 4 parameters. Results are shown in Figure El 

As expected, the divergence parameter <f> was positively 
correlated with both the modularity Q and the diame¬ 
ter. When (f> ss 0, the network quickly reaches a fully- 
connected state, where all proteins are linked to all other 
proteins. Consequently, the network is not organized into 
modules, and all distances in the network are equal to 1. 
The ‘thinning out’ of duplicate links is therefore essential 
to generate non-trivial network features. The opposite 
occurs when the NE rate /i is too low: /i « 0, and the 
network evolves towards a completely disconnected state. 

Why does gene duplication lead to modularity? Con¬ 
sider an initially uniform network. When a node is dupli¬ 
cated at random, this causes the original node, the copy, 
and their immediate neighbors to share more links inter¬ 
nally than they do with the rest of the network. Sub¬ 
sequent duplications amplify this effect: if a node that 
has 10 links within a module but only 2 external links is 
duplicated, there are now 20 internal and 4 external links 
(prior to post-duplication divergence). 

Interestingly, Q has a weak negative correlation with 
the assimilation parameter a. When a is large, the prob¬ 
ability to link to distant neighbors of the target pro¬ 
tein is relatively high, and mutated proteins have a non- 
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negligible chance to generate links to proteins outside 
of their target’s pathway. This causes modules to blur 
at the edges; their member proteins will share a higher 
number of links to other modules than for a low a net¬ 
work. Although the modularity is reduced for a high a 
network, it does not disappear entirely. Similarly, there 
is a sharp decrease in Q as the NE rate surpasses the DU 
rate, indicating the important role of the DU mechanism 
in modular organization. 

Diameter is also negatively correlated with a. When a 
single NE event has a significant chance to generate links 
to the target protein’s neighbors, this tends to reduce the 
overall separation of proteins in the network. 


DISCUSSION 


The relevance of selection to PPI network evolution 
has been a topic of considerable debate [82], particularly 
in the context of higher-order network features, such as 
modularity. A number of authors have argued that spe¬ 
cific selection programs are required to generate modular 
networks, such as oscillation between different evolution¬ 
ary goals However, previous work has shown 

that gene duplication by itself, in the absence of both 
natural selection and neofunctionalization, can generate 
modular networks [89] [90i. Consistent with the findings 
of [SS] [SD], modularity in our model is primarily gen¬ 
erated by gene duplications (Figure 11 see SI for sen¬ 
sitivity analysis). Unfortunately, duplication-only mod¬ 
els err in their predictions of other network properties 
(Tables [TT| and [S2| Figure S7). A well-known problem 
with duplication models is that they generate excessively 
fragmented networks, with only about 20% of the pro¬ 
teins in the largest component. This is in sharp contrast 
to real PPI networks, which have 73% to 89% of their 
proteins in the largest component. Neofunctionalization- 
only models have most of their proteins in the largest 
component, but are significantly less modular than real 
networks. As shown in Table [TTJ by modeling duplica¬ 
tion and neofunctionalization simultaneously, the DUNE 
model generates networks which have the modularity 
found in duplication-only models, while retaining most 
proteins in the largest component. This lends support to 
the idea that gene duplication contributes to the modu¬ 
larity found in real biological networks, and that pro¬ 
tein modules can arise under neutral evolution, with¬ 
out requiring complicated assumptions about selective 
pressures. This is consistent with recent experimental 
work characterizing a real-world fitness landscape, show¬ 
ing that it is primarily shaped by neutral evolution [HD- 
One model for the fate of duplicate genes is subfunc¬ 
tionalization (SF). In SF, the original and the duplicate 
genes are both free to lose their redundant functions, so 
they can evolve freely until they exactly reproduce the 
ancestral function [92] ■ The post-duplication divergence 
in our model is similar in spirit to SF, but it differs in 
two significant ways: (1) in our model, the link loss is 


completely asymmetric, and (2) a fraction (1 — cf> ) of the 
redundant links are retained, so, unlike SF, not all of 
the redundancy is eliminated. For the first point, empir¬ 
ical evidence suggests that the divergence is asymmet¬ 
ric ED, although the assumption of complete asymmetry 
would likely need to be revisited to build a finer-grained 
model. Second, genetic regulatory networks have been 
shown to be robust to random link deletions, indicating 
that these networks retain some degree of redundancy 
[93H95] . In silico evidence suggests that a more accu¬ 
rate picture may be of a transient period of functional 
divergence, followed by prolonged neofunctionalization, 
resulting in only a partial loss of redundancy [20 • This 
is consistent with our model. 

One example of a known biological mechanism which 
should lead to assimilation is domain shuffling, the copy- 
and-pasting of part of one protein into another L 97j (98]. 
The neofunctionalization mechanism described here is 
quite general, and includes domain shuffling, among 
other methods of PPI creation. A PPI formed via domain 
shuffling will often be the result of a binding site duplica¬ 
tion. Assuming the binding is due to simple surface simi¬ 
larity, the initial link will be to the protein which had its 
domain copied. The likelihood of binding to neighbors 
of the original protein should depend only on the prob¬ 
ability that each interaction is due to surface similarity 
because the copied binding site will be identical to the 
original. 

The role of domain shuffling in assimilation raises the 
question of whether domains should be modeled explic¬ 
itly, rather than representing proteins as integral units. 
Previous work indicates that overall PPI network topol¬ 
ogy is robust to the details of domain shuffling 179] . 
Moreover, while proteins which have experienced domain 
shuffling have a higher average degree than other pro¬ 
teins, high- and low-degree proteins are equally likely to 
acquire new interactions this way m- Because the cre¬ 
ation of new links by domain shuffling should be topolog¬ 
ically very similar to the creation of new links by other 
neofunctionalization events, we believe our model is a 
reasonable implementation of this mechanism, as it ap¬ 
plies to the evolution of network topology. 

Previous estimates of NE rates in eukaryotes have var¬ 
ied widely, generally falling in the range of 100 to 1000 
changes/genome/Myr [24] 06[ [52j, or on the order of 0.1 
change/gene/Myr. However, more recent empirical work 
has identified several problems with the methods used to 
obtain these estimates, suggesting that de novo link cre¬ 
ation is much less common than previously thought 1481 . 
This is consistent with our model. The best-fit values of 
our NE rate /i are in the range of 10 -5 to 10 _4 /gene/Myr 
(Table [i]), which in all three organisms are considerably 
slower than the duplication rates d. 

Biologically, many of the interactions created by our 
neofunctionalization mechanism are expected to initially 
be weak, non-functional interactions. The results of m 
suggest that strong functional interactions are correlated 
with hydrophobicity, which in turn is correlated with 
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promiscuity. We posit that initially weak, non-functional 
interactions are an essential feature of PPI evolution, as 
they provide the ‘raw material’ for the subsequent evolu¬ 
tion of functional interactions. If this reasoning is correct, 
one consequence should be that hub proteins are, on av¬ 
erage, more important to the cell than non-hub proteins. 
This has been found to be true: both degree m and be¬ 
tweenness centrality |68) have positive correlations with 
essentiality, indicating that hub proteins are often critical 
to the cell’s survival. 

We have described here a model for how eukaryotic 
protein networks evolve. The model, called DUNE, im¬ 
plements two biological mechanisms: (1) gene duplica¬ 
tions, leading to a superfluous copy of a protein that 
can change rapidly under new selective pressures, giving 
new relationships with other proteins and (2) a protein 
can undergo random mutations, leading to neofunction¬ 
alization, the de novo creation of new relationships with 
other proteins. Neofunctionalization can lead to assim¬ 
ilation, the formation of extra novel interactions with 
the other proteins in the target’s neighborhood. Bio¬ 
logical evidence suggests that this type of mechanism 
exists. Our specific implementation is based on a sim¬ 
ple geometric surface-compatibility argument for the ob¬ 
served transitivity in PPI networks. This is, of course, a 
heavily simplified model of PPI network evolution, and 
there are many biological factors which have not been 
included. However, our relatively simple model shows 
good agreement with 10 topological properties in yeast, 
fruit flies, and humans. One finding is that PPI networks 
can evolve modular structures, just from these random 
forces, in the absence of specific selection pressures. We 
also find that the most central proteins also tend to be 
the oldest. This suggests that looking at the structures of 
present-day protein networks can give insight into their 
evolutionary history. 


Methods 

Genome-wide PPI screens have a high level of noise 
[19j . and specific interactions correlate poorly between 
data sets [53]. We found that several large-scale features 
differed substantially between types of high-throughput 
experiments (see SI). Due to concerns about the accuracy 
and precision of data obtained through high-throughput 
screens, we chose to work with ‘high-confidence’ data 
sets consisting only of pairwise interactions confirmed in 
small-scale experiments, which we downloaded from the 
public HitPredict database m ■ We found sufficient 
high-confidence data in yeast (S. cerevisiae ), fruit flies 
( D. melanogaster), and humans ( H. sapiens). 

All simulations and network feature calculations were 
carried out in Matlab. Our scripts are freely available 
for download at http://interacto.me. We computed 
betweenness centralities, clustering coefficients, shortest 
paths, and component sizes using the MatlabBGL pack¬ 
age. Modularity values were calculated with the algo¬ 


rithm of ( l03j . All comparisons (except the degree dis¬ 
tribution) are between the largest connected components 
of the simulated and experimental data. 

Due to the human network’s somewhat larger size, 
most dynamical features were calculated once per 50 time 
steps for the human network, but were updated at every 
time step in the yeast and fly networks. For dynamical 
plots, the y coordinates of the trend line are medians-of- 
medians. The amount of time elapsed per time step (the 
x coordinate) varies between simulations. We binned the 
time coordinates to the nearest 10 million years for yeast 
and fly, and 25 million years for human. When multiple 
values from the same simulation fell within the same bin, 
we used the median value. We then calculated the me¬ 
dian value between simulations. Scatter plot trend lines 
are calculated in a similar way. The trend line repre¬ 
sents the median response variable (C, 6, or t) value over 
all nodes within a single simulation with degree k. The 
y coordinate of the trend line is therefore the median 
(across 50 simulations) of these median response vari¬ 
ables. This median-of-medians includes all simulations 
that have nodes of a given degree. 


Empirical data 

We downloaded large-scale data sets from BioGRID 
una, and used the Wilcoxon rank-sum test to compare 
aggregate statistical features across various experimental 
types in yeast (5. cerevisiae) and humans ( H. sapiens) 
(ZB). As expected, we found that data obtained by affin¬ 
ity capture was significantly different than pairwise ex¬ 
perimental data (primarily yeast two-hybrid and in vitro 
complexation), as the affinity capture interactions repre¬ 
sent entire complexes, which is somewhat different infor¬ 
mation than the pairwise interactions we are attempting 
to capture using our model. However, more surprisingly, 
the only feature to show significant agreement between 
pair-wise techniques was the eigenvalue distribution of 
the walk matrix (P > 0.05). Further sub-dividing the 
individual techniques into smaller data sets containing 
only results obtained in single experiments, we discov¬ 
ered that, again, only the spectra agreed between differ¬ 
ent screens. 

Note that, due to the small size of the fly network, 
there may be too many missing links to obtain an accu¬ 
rate description the network’s large-scale topology. Al¬ 
though, by appropriate parameter tuning, our model is 
able to accurately reproduce the fly network, it is possi¬ 
ble that different parameters will be required to match 
the fly network once it becomes more fully character¬ 
ized experimentally. The data sets considered here do 
not include interactions which are enabled through post- 
translational modifications. Although these data sets 
are far from complete, and may be susceptible to false¬ 
positive detections, these appear to be the most accurate 
data available at the present time. 
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ated by NE per protein is only about 8% higher in hu¬ 
mans than in yeast [51]. These results support the con¬ 
jecture that the probability for a protein to receive a new 
link via point mutation is approximately independent of 
N, as previously noted in [35]. Due to the finite copy 
number of proteins, as well as the compartmentalization 
of eukaryotic cells, we regard it as unlikely that proteins 
will simultaneously acquire multiple links to targets in 
different locations in the cell, or which are involved in 
divergent biological processes. 


Large-scale duplications 

Our model does not explicitly consider simultaneous 
duplication of multiple genes (chromosomal duplications, 
whole genome duplications, etc.). However, as shown in 
Table [T} duplication rates in our model are considerably 
higher than neofunctionalization rates, so that, on aver¬ 
age, there are multiple duplications per neofunctionaliza¬ 
tion event. Sequential duplications of this type may be 
thought of as an (imperfect) representation of multi-gene 
duplications. The advantage of this approach is that we 
do not require separate rates for each duplication scale 
(one could imagine an extremely detailed model which in¬ 
cluded separate rates for gene duplication, gene-pair du¬ 
plication, gene-triplet duplication, etc.). The downside 
is that our implementation of larger-scale duplications 
will generally include some genes which have been du¬ 
plicated multiple times, and others which have not been 
duplicated at all. A potential mitigating factor is that 
the rate of gene loss (and evolution in general) follow¬ 
ing genome duplication is very high [3D, [SB], so even a 
completely faithful large-scale duplication would likely 
be altered within short order. 


SUPPORTING INFORMATION 
Randomness conjecture 

We describe here a test of our randomness conjecture, 
q = 1/N. The implication is that the average number 
of NE links per protein should be independent of N. 
Another possibility is that q is constant, implying that 
the number of NE links per protein is proportional to 

N. To test this, we compared NE rates in the human 
and yeast PPI networks. The total number of proteins 
in humans is estimated to be 22740 [ 105] and in yeast, 
5616 [106] . The number of mutations to coding DNA is 
approximately 0.004/genome/replication in humans and 

O. 0027/genome/replication in yeast 107 . If the number 
of new links is proportional to N, then, based on the num¬ 
ber of proteins and the mutation rate, there should be 
roughly 600% more links created by NE in humans than 
in yeast. However, by counting the number of nonre- 
dundant interactions in duplicate gene pairs, it has been 
shown empirically that the average number of links cre- 




FIG. SI. Walk matrix eigenvalues. Shown are eigenvalue 
(A) distributions in human (green), yeast (blue), and fly (red). 
Heavy lines are the median values from 50 simulations, and 
light lines are results of individual simulations. 
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FIG. S2. Dynamical features. Shown are the evolution of (A) the largest component size, (B) the fraction of orphan proteins, 
(C) the global clustering coefficient, and (D) the second-largest eigenvalue of the walk matrix, in human (green), yeast (blue), 
and fly (red). Light lines indicate the evolutionary trajectories of 50 individual simulations, and the heavy line is the median 
value. Empirical data values are shown as a dashed line, where available. 


Eigenvalues 

The connectivity of a network can be expressed by its 
adjacency matrix , an N x N matrix A, in which the en¬ 
tries Aij equal 1 if a link exists between proteins i and j, 
and 0 otherwise. If A is normalized by column, then the 
entries describe the rates of a transition from i to j in one 
time step. The distribution of eigenvalues p( A) is called 
the network’s spectrum (Figure [Si] ). This matrix and its 
eigenvalues can be interpreted in terms of a process in 
which a random walker starts on one node i, and, over a 
series of time steps, reaches another node j. Intuitively, 
this can be thought of as a signal propagation rate: if 
one protein is affected by an external signal, how long 
does it take that signal to diffuse through the network? 
The eigenvalues A of this ‘walk matrix’ describe the rate 
at which a random walk on the network reaches steady- 
state. The second-largest eigenvalue (A 2 ) determines the 
rate of convergence of the random walk (Figure [S2|. A 


larger value of A 2 indicates a slower signal propagation 
rate. 


Error tolerance 

We measured the ‘error tolerance’ as described in [251 : 
we examined the decrease of /i when nodes (and their ac¬ 
companying edges) were deleted from the network either 
(1) at random, or (2) according to their degree, start¬ 
ing with the most well-connected node. Results for the 
simulated and experimental networks were very similar 
(Figure [S4|. 

Simulation length: early versus late evolution 

The total time elapsed during our simulations varies 
considerably, with yeast and human simulations running 
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FIG. S3. Individual protein centrality scores. Evolution of degree (A) and betweenness (B) for proteins introduced to the 
network at different times in humans (top), yeast (middle), and flies (bottom). The 1st protein (one of the two initial proteins) 
is shown in red, the 6th protein in black, the 11th protein in blue, and the 101st protein in green. Curves are median values 
from 50 simulations. 


about 1 to 2 billion years, and the fly simulations about 
5 billion years. This is compared to the rough esti¬ 
mate of 3.5 billion years since the origin of life on Earth 
11081 . The exceptionally long duration of the fly sim¬ 
ulations are due to the very low gene duplication rate 
(d = 0.001/gene/Myr). The aim of our model is to 
describe the evolution of PPI networks with all their 
present-day machinery. Gene duplication, in the form 
in which it exists today, certainly would not have existed 
at the origin of life! The initial state in our model con- 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

fraction of nodes deleted 


FIG. S4. Error tolerance. Shown are error tolerance curves 
in human (green), yeast (blue), and fly (red). Circles indi¬ 
cate proteins deleted randomly, and squares indicate proteins 
deleted starting with the most well-connected protein and re¬ 
moving proteins in descending order. 


sists of two interacting proteins. Biologically, these are 
two polypeptides (or, more likely, RNA molecules) in a 
pre-biotic soup, that happen to interact in a way that is 
mutually beneficial. Each of these molecules has the abil¬ 
ity to replicate. This autonomous replication of individ¬ 
ual proteins corresponds to ‘gene duplication’ in the very 
early stages of evolution. However, this is a very different 
conceptual underpinning for the duplication mechanism, 
and it seems unlikely to share the present-day values of 
the duplication rate. Because, in the early stages of evo¬ 
lution, each time step represents a very long duration in 
real time, it is likely that this accounts for the discrep¬ 
ancy in total time elapsed. 

Fitting functions 

The degree distribution obeys a p ower law in its tail, 
p(k) ~ fc -7 [1~09] , with 7«3 (Table [Si]), implying that 
hub proteins are more common than would be expected 
for a randomly connected network, which would have an 
exponentially decaying p{k). The closeness distribution 



7 (3 £ a 8 

Yeast 

Fly 

Human 

2.8(2) 2.4(1) 1.8(2) 1.2(2) 0.32(6) 
3.1(2) 2.2(4) 0.8(5) 0.8(7) 0.0(3) 
2.8(1) 2.3(1) 1.5(3) 1.3(2) 0.0(1) 


TABLE SI. Scaling exponents. Distributional exponents 
( p(k ) ~ k y , p(b) ~ 6~ 8 ) were estimated using the maxi¬ 
mum likelihood method of [109] , Other exponents (C ~ k~ 
b ~ k a , n ~ k~ s ) were estimated using nonlinear regression. 
Due to the relatively small sizes of the data sets, there is 
considerable uncertainty in these estimates. 
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p(() is approximately Gaussian, with mean 0.17 and stan¬ 
dard deviation 0.03 in humans, mean 0.19 and standard 
deviation 0.03 in yeast, and mean 0.13 and standard de¬ 
viation 0.03 in flies. Closeness is a measure of distance, 
indicating that the distances within the network are es¬ 
sentially a random walk in ‘node space’. The betweenness 
distribution also follows a power law in its tail. This is 
an indication of modular structure, due to the overrep¬ 
resentation of ‘bridge’ proteins, relative to a randomly 
connected network. 

All species examined show a power law decay in clus¬ 
tering coefficient as a function of degree, C ~ k~ 
Poorly-connected proteins therefore tend to have higher 
clustering coefficients, meaning that a greater fraction of 
their neighbors are mutually connected. 

Disassortative mixing was quantified for the yeast PPI 
network in |64j as a power law decrease in median neigh¬ 
bor degree, n ~ k~ s . This is consistent with our data, 
although the very small estimated value of S = 0.32 indi¬ 
cates only a slight negative relation (Table |S1[ ) . Interest¬ 
ingly, 6 = 0 in both human and fly networks, indicating 
that disassortativity may be a trait unique to the yeast 
network. 


Principal component analysis 


We examined six features which calculate a value for 
each node in the network: degree centrality, clustering 
coefficients, closeness centrality, eigenvalue spectrum, be¬ 
tweenness centrality, and mean nearest-neighbor degree. 
To quantify the independence of these features, we used 
principal component analysis (PCA) 1 10 . Each feature 
assigns a value to each node in the network, giving a 
6 x N data matrix, where each row represents a feature 
(signal), and each column is a node (sample). We sub¬ 
tract the mean and divide by the standard deviation of 
each row. This results in a standardized data matrix, de¬ 
noted by Y. The 6x6 correlation matrix for each species 
is defined as Ce jW_YY t : 
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FIG. S5. Laplacian eigenvector participation. Elements 
of the eigenvector of the Laplacian matrix (defined as K — A, 
where K is a diagonal matrix with the degree of node i as el¬ 
ement Ka) associated with the largest eigenvalue vs. protein 
age index (time of introduction) in the yeast simulation. De¬ 
tails of this method are discussed in [76! . Heavy lines are the 
median values from 50 simulations, and light lines are results 
of individual simulations. The inset plot shows the trend line 
with a rescaled y-axis. 
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The entries of each C are (from left-to-right, and top- 
to-bottom): degree centrality, clustering coefficients, 
betweenness centrality, closeness centrality, eigenvalue 
spectrum, and mean nearest-neighbor degree. Many of 
the off-diagonal elements of the C matrices are close to 
zero, suggesting that the features are to a large extent 
independent of one another. 

To perform PCA, we diagonalized each correlation ma¬ 
trix, 


C = SAS t , 


(7) 


where A is a diagonal matrix of eigenvalues and S has the 
eigenvectors of C as its columns. As shown in Figure [S6j 
the degree and betweenness show similar loadings on the 
first two principal components, reflecting the nearly lin¬ 
ear relation between these centrality scores (Figure [6]). 
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FIG. S6. Principal component analysis. Shown are the factor loadings and scores on the first two principal components. 
Data scores are shown in red, and blue lines represent feature loadings. 


The eigenvalue matrices A are given by: 


Comparison to other models 


Ai, = 


2.27 


1.23 


1.02 


0.98 


0.40 


0.11 


( 8 ) 


Ay = 


2.20 


1.30 


1.01 


0.98 


0.43 


0.08 


(9) 


Af = 


1.95 


1.24 


1.13 


0.91 


0.44 


0.32 


( 10 ) 


(Zeros have been suppressed for clarity.) The fraction 
of variance explained by the zth principal component is 
given by A t ,/ ]Tb A jj. The closer the number of compo¬ 
nents required to explain most of the variance is to the 
total number of input signals, the more independent the 
signals are. In yeast and humans, 4 components are re¬ 
quired to explain 90% of the variance; in fruit flies, it re¬ 
quires 5 components. Linear transformations are able to 
only modestly reduce the dimensionality of the problem, 
suggesting that each feature contributes unique informa¬ 
tion about the network’s structure. This does not, of 
course, rule out the possibility of the existence of other in¬ 
dependent, informative features, a far more complicated 
question which is outside the scope of this current work. 


Our model is rooted in previous modeling efforts. The 
basic framework for our model combines the gene dupli¬ 
cation mechanism described in |29j | with a link creation 
mechanism inspired by [52] . The principal difference be¬ 
tween our model and previous models is that our model 
considers duplication and mutation simultaneously. The 
previous models we examined attempted to construct the 
PPI network from a single mechanism. Another signif¬ 
icant difference is our assimilation mechanism. To the 
best of our knowledge, previous work has not explicitly 
modeled proteins integrating into biological pathways. 

We compare the DUNE model to four models previ¬ 
ously proposed for PPI networks. Two were evolution¬ 
ary models: (1) the Vazquez model of DU followed by 
rapid loss-of-function mutations [29] and (2) the Berg 
‘link dynamics’ model of point mutations coupled with 
a PA-like ‘rich-get-richer’ rule for assigning new interac¬ 
tions [52] . (A slightly different DU model is presented 
by Pastor-Satorras m ■ However, because the Vazquez 
model has been shown to be a better fit to experimen¬ 
tal data m, we have limited our DU-only comparison 
here to the Vazquez model.) Two others were static mod¬ 
els (models of present-day networks that do not simulate 
the network’s evolutionary path) that consider the pri¬ 
mary organizing principle to be nonspecific interactions 
between proteins: (1) random geometric (RG), a mathe¬ 
matical model where proteins are randomly scattered in a 
2 to 4 dimensional box, and any proteins close enough to 
one another form an interaction [52j and (2) the ‘MpK’ 
desolvation model, which assigns interactions based on 
proteins’ exposed hydrophobic surface areas (M]. For 
reference, we also calculated results for an Erdos-Renyi 
(ER) random graph with N and ( k ) set by the data [55]. 

These models were originally validated against differ¬ 
ent features of the empirical network, making it difficult 
to directly compare them. To characterize these mod¬ 
els in greater detail, we coded each of these models, and 
ran 50 simulations of each model with identical parame¬ 
ters and starting conditions. Using Matlab, we coded the 
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FIG. S7. Model comparison. Comparison of five other models to the yeast PPI network: Vazquez [29] (green), Berg [52[ 
(red), random geometric [S3] (dark blue), MpK desolvation [53j (purple), and ER random graph [55] (brown). For reference, 
DUNE model results are shown as a black line. Dots represent high-confidence experimental yeast data, and solid lines are 
median values over 50 simulations. 


Vazquez [25] . Berg [52], RG [53], and MpK [51] models 
as described in the original papers. Since each model was 
originally parametrized for older yeast PPI data sets, we 
re-optimized the parameters for our yeast data as follows. 
We used a Monte Carlo simulation to adjust each model’s 
parameters to minimize the total symmetric mean abso¬ 
lute percentage error values (SMAPE; see below) for the 
yeast HitPredict data set. 

For the Vazquez model, we used a value of 0.582 for the 
post-duplication divergence probability, and a value of 
0.083 for the dimerization probability. As noted by pre¬ 
vious authors, duplication-only simulations produce net¬ 
works which are extremely fragmented [89]. We observed 
that the Vazquez simulations typically had around 20% 
of their nodes in the largest connected component (Ta¬ 


ble 0 - Since most of the network features we examined 
are limited to the largest component, in order to make a 
reasonable comparison of the Vazquez simulation results 
to the data, we allowed the simulated network to grow 
until its number of links met or exceeded 5 times the num¬ 
ber of links in the data, K > 5A'd a ta- Since the largest 
component is not always exactly 20% of the total nodes, 
this stopping condition is somewhat arbitrary; however, 
results for this model seem robust to small changes in 
the stopping condition. For the Berg model, we used the 
empirically estimated duplication rate of 0.01/gene/Myr, 
and found best-fit values of 24.5/gene/Myr for the muta¬ 
tion rate, and Ajjata — 98 proteins for the initial network 
size. For the RG model, we used a 45.5 x 45.5 x 45.5 
‘box’ with a maximum interaction radius of 3.92. For 
























22 



p(k) 

p{l) 

p(b) 

C(k) 

n(k ) 

b(k) 

pW 

E.T. 

E.T. (fc) 

p(n) 

m 

Total 

DUNE 

0.47 

0.37 

0.36 

0.60 

0.29 

0.24 

0.14 

0.10 

0.20 

0.56 

0.14 

3.48 

Vazquez 

0.52 

0.81 

0.39 

0.62 

0.27 

0.34 

0.12 

0.10 

0.15 

0.50 

0.26 

4.11 

Berg 

0.70 

0.45 

0.72 

0.67 

0.15 

0.19 

0.29 

0.12 

0.30 

0.86 

0.06 

4.51 

RG 

0.81 

0.98 

0.61 

0.83 

0.22 

0.23 

0.33 

0.31 

0.45 

0.89 

0.49 

6.15 

MpK 

0.82 

0.79 

0.68 

0.89 

0.88 

0.58 

0.18 

0.18 

0.18 

0.82 

0.16 

6.18 

ER 

0.83 

0.70 

0.78 

0.53 

0.22 

0.41 

0.28 

0.07 

0.32 

0.93 

0.07 

5.14 


TABLE S2. Symmetric mean absolute percentage error (SMAPE) of simulation versus experiment in yeast (Eq. 111. ‘E.T.’ is 
the error tolerance curve with random protein removal, and ‘E.T. ( k )’ is the error tolerance curve with highest-degree proteins 
removed first. ‘DUNE’ is the model described here, ‘Vazquez’ is the DU-only model of [29], ‘Berg’ is the link dynamics model 
[52] , ‘RG’ is random geometric [53l, ‘MpK’ is the physical desolvation model presented in [54], and ‘ER’ is an Erdos-Renyi 
random graph [55]. For each comparison, the lowest value is shown in bold. 


the MpK model, the number of exposed surface residues 
was 19, the fraction of exposed hydrophobic residues was 
M = 0.230±0.110 (mean ± standard deviation), and the 
best-fit linear equation relating M to the binding thresh¬ 
old was 1.09M + 1.04. 

The Vazquez simulations were initialized with 2 con¬ 
nected nodes, and the simulation was allowed to run un¬ 
til K > 5A'uata- The Berg simulations were initialized 
with IVdata — 98 randomly connected nodes, then run un¬ 
til N = IVdata* The RG and MpK models (which are not 
evolutionary models and therefore create the network all 
at once) were set up as described in the original papers. 

To characterize the networks, we computed several net¬ 
work properties: 


Single-value:: modularity Q , diameter D , fraction of 
nodes in the largest component f \, global clustering 
coefficient (C), and average protein degree in the 
largest component (k) (Table [IT) 


Distributional:: degree p(k), betweenness p(b), close¬ 
ness p(t), eigenvalue p{ A), and nearest-neighbor de¬ 
gree p{n) distributions 


Scatter plot:: closeness vs. degree £(k), clustering coef¬ 
ficient vs. degree C(k), betweenness vs. degree b(k), 
median nearest-neighbor degree vs. degree n(k), 
and error tolerance curves 


and compared these features to those of empirical data 
from yeast. As shown in Figure S7 we found that none 


of the previous models capture the full set of network 
properties. 

To quantify agreement with the data for non-single¬ 
value features, we calculated the symmetric mean abso¬ 
lute percentage error (SMAPE) between simulation and 
experiment [ 1 1 3 1114] : 


*t ^ | 7/ . _ „.dataI 

SMAPE = - V 

Y -* oi. u/data 

1 a yi ' yi 


(ii) 


where Y is the number of data points, and yi and y\ 


data 


denote the ith point of the response variable (Table S2) 
in the simulated and experimental data, respectively. For 
the distributional features, Y is the number of bins (ar¬ 
bitrarily chosen to be 100) minus the number of bins 
in which yi + y^ata _ q For non -distributional (scatter 
plot) features, Y is the number of k values with values for 
both simulation and experiment. There are many possi¬ 
ble measures of accuracy (such as the widely-used root 
mean squared error); we used SMAPE for two reasons. 
First, because it relies on absolute value, SMAPE does 
not over-emphasize the impact of outliers. Second, di¬ 
viding by yi + yf ata ensures that the magnitude of the 
response variable does not overwhelm the sum. This is 
significant for the non-distrbutional features. For exam¬ 
ple, in a plot of betweenness vs. degree (Figure[6]), we are 
just as interested in the overlap of the low-betweenness, 
low-degree region of the curve as we are with the high- 
betweenness, high-degree region. SMAPE values are col¬ 
lected in Table [S2] As shown in Tables [TT] and [S2j while 
previous models accurately reproduce certain features of 
the PPI network, only the DUNE model provides a rea¬ 
sonable across-the-board fit. 












